Skip to main content

Background backend liveness


Monitor: View in Axiom Service: backend-background (AWS) / backend (GCP)


Overview​

This monitor tracks the self-reported heartbeat from the background backend service. A triggered alert means no heartbeats have been received in the last 2 minutes for the alerted region.

Architecture note: We operate two distinct backend services:

  • backend — API-facing service
  • backend-background — responsible for background tasks, queue processing, pollers, and scheduled jobs

In AWS, these are deployed as separate services. In GCP, they run as a single unified backend deployment.

When this monitor fires, the cause is one of two things:

  • A momentary blip — transient disruption, usually self-resolving
  • An ongoing outage — backend-background is down, or the heartbeat reporting mechanism has failed

🔍 Step 1: Determine if the Issue is Momentary or Ongoing​

Open the monitor in Axiom and examine the heartbeat history for the alerted region.

  • If heartbeats resumed on their own → this was a transient issue. Document the event and continue monitoring. No further action required.
  • If heartbeats are still missing → proceed to Step 2.

ðŸĐš Step 2: Assess Impact​

2a. Are Pollers Running?​

Run the following query in Axiom to determine whether background jobs are still executing in the affected region:

traces
| where ['service.name'] == "legion-backend"
| where ['resource.deployment.environment'] == "production"
| where name == "run_job"
Poller traces present?Likely causeNext step
✅ YesService is running, but heartbeat reporting has failed→ Step 3: Investigate the heartbeat code path
❌ Nobackend-background may be down entirely→ Step 4: Investigate the service health

2b. Are Queue Readers Healthy?​

Check the health of the message queues for the affected region:

  • AWS — Navigate to SQS in the relevant region and inspect the queues for message backlog, age of oldest message, and consumer activity.
  • GCP — Navigate to Pub/Sub and check the relevant subscriptions for undelivered message count and consumer activity.

A growing backlog with no consumer activity is a strong signal that the backend-background service is down or stuck.


🔧 Step 3: Heartbeat Flow Investigation (Pollers Running, Heartbeats Missing)​

⚠ïļ Reaching this step is uncommon. The heartbeat() function is a simple while loop — a single log write followed by a sleep — so there is little room for error. If pollers are running but heartbeats are missing, the most likely explanation is a stalled or blocked event loop rather than a bug in the heartbeat logic itself. This likely requires deeper runtime investigation (e.g., event loop analysis, thread/async inspection).


ðŸšĻ Step 4: backend-background Service Is Down​

If no poller traces are found, the backend-background service is likely unhealthy or fully down.

Check the service health for the relevant region:

  • AWS — Navigate to the backend-background service in ECS and inspect running task count, recent failures, container restarts, and CPU/memory utilization.
  • GCP — Check the backend Cloud Run service for instance health and recent errors. Note: GCP runs a single unified backend deployment with no separate backend-background.

If the service is clearly unhealthy, restart it following the Restarting Backend Service guide.

⚠ïļ Restarting restores availability — but you still need to understand why it went down. Continue to Step 5.


ðŸŠĩ Step 5: Examine Logs​

Regardless of the resolution path, investigate logs around the incident timeframe to identify the root cause.

See View Container Logs for instructions on accessing logs in both AWS and GCP.

In both cases, look for:

  • Exceptions or stack traces
  • OOM (out of memory) kills
  • Service crashes or restart loops
  • Connectivity errors (DB, queue, downstream services)

Use findings to document the root cause and open a follow-up ticket if a code fix, capacity change, or dependency investigation is needed.